[SOUND] This lecture is about
evaluation of text retrieval systems.
In the previous lectures, we have talked
about a number of text retrieval methods.
Different kinds of ranking functions.
But how do we know which
one works the best?
In order to answer this question,
we have to compare them,
and that means we'll have to
evaluate these retrieval methods.
So this is the main topic of this lecture.
First, let's think about why
do we have to do evaluation?
I already gave one reason.
And that is,
we have to use evaluation to figure out
which retrieval method works better.
Now this is very important for
advancing our knowledge.
Otherwise we wouldn't know whether
a new idea works better than old idea.
In the beginning of this
course we talked about the,
the problem of text retrieval we
compare it with database retrieval.
There, we mentioned that text retrieval
is imperative to find the problem.
So, evaluation must rely on users,
which system works better,
that would have to be judged by our users.
So this becomes very challenging problem.
Because how can we get users involved in,
in matters, and
how can we draw a fair
comparison of different methods.
So just go back to the reasons for
evaluation.
I listed two reasons here.
The second reason is basically what I just
said but there is also another reason,
which is to assess the actual
utility of a test regional system.
Now imagine you're building
your own applications.
Would be interested in knowing how well
your search engine works for your users.
So in this case measures must
reflect the utility to the actual
users in the the real application.
And typically, this has been
done by using user studies and
using the real search engine.
In the second case or for
the second reason, the measures
actually all need to be correlated
with the utility to actual users.
Thus they don't have to accurately
reflect the, the exact utility to users.
So the measure only needs to be good
enough to tell which method works better.
And this is usually done
through test collection.
And this is the main idea that we'll
be talking about in this course.
This has been very important for
comparing different algorithms and
for improving search
engines systems in general.
So next we will talk
about what to measure.
There are many aspects of a search engine
we can measure, we can evaluate and
here I list the three major aspects.
One is effectiveness or accuracy,
how accurate are the search results?
In this case we're measuring a system's
capability of ranking relevant documents
on top of non relevant ones.
The second is efficiency.
How quickly can a user get the results?
How much computing resources
are needed to answer a query?
So in this case we need to measure
the space and time overhead of the system.
The third aspect is usability.
Basically the question is how useful
is the system for real user tasks?
Here, obviously, interfaces and
many other things are also important and
we typically would have
to do user studies.
Now, in this course, we're going to talk
more, mostly about the effectiveness and
accuracy measures because,
the efficiency and
usability dimensions are, not really
unique to search engines, and so,
they are, needed for
evaluating any other software systems.
And there is also good coverage of
such materials in other courses.
But how to evaluate a search engine
is quite, you know accuracy is
something you need to text retrieval, and
we're going to talk a lot about this.
The main idea that people have proposed
before using a attitude, evaluate
a text retrieval algorithm, is called
the Cranfield Evaluation Methodology.
This one actually was developed long
time ago, developed in the 1960s.
It's a methodology for laboratory test
of system components, it's actually
a methodology that has been very useful,
not just for search engine evaluation.
But also for evaluating virtually
all kinds of empirical tasks.
And, for example in processing or
in other fields where the problem
is empirically defined we typically would
need to use to use such a methodology.
And today was the big data challenge with
the use of machine learning every where.
We general, this methodology has been very
popular, but it was first developed for
search engine application in the 1960s.
So the basic idea of this approach is
it'll build a reusable test collections
and define measures.
Once such a test collection is
build it can be used again and
again to test the different algorithms.
And we're going to define measures
that would allow you to quantify
performance of a system or
an, an algorithm.
So how exactly would this work?
Well, we're going to do,
have assembled collection of documents and
this is just similar to real document
collection in your search application.
We can also have a sample
set of queries or topics.
This is to simulate the user's queries.
Then we'll have to have
relevance judgments.
These are judgments of which documents
should be returned for which queries.
Ideally, they have to made by
users who formulated the queries
because those are the people that know
exactly what documents would be used for.
And then finally we have to have measures
to quantify how well a system's result
matches the ideal ranked list.
That would be constructed and
based on users' relevant judgements.
So this methodology is very useful for
starting retrieval
algorithms because the test can actually,
can be reused many times.
And it will also provide a fair
comparison for all the methods.
We have the same criteria,
same data set to use and
to compare different algorithms.
This allows us to compare a new
algorithm with an old algorithm,
that was the method of many years ago.
By using the same standard.
So this is the illustration
of how this works, so
as I said,
we need a queries that are shown here.
We have Q1, Q2, et cetera.
We also need a documents, and
that's called the document collection,
and on the right side,
you see we need relevance judgment.
These are basically the binary judgments
of documents with respect to a query.
So, for example D1 is judged
as being relevant to Q1,
D2 is judged as being relevant as well.
And D3 is judged as non relevant
in the two, Q1, et cetera.
These would be created by users.
Once we have these, and
we basically have a test, correction, and
then, if you have two systems,
you want to, compare them.
Then you can just run each
system on these queries and
documents and
each system will then return results.
Let's say if the query is Q1 and
then we would have the results here,
here I show R sub A as
results from system A.
So, this is remember we talked about
task of computing approximation of the,
relevant document setter.
So A is,
the system A's approximation here, and
also B is system B's approximation
of relevant documents.
Now let's take a look at these results.
So which is better?
Now imagine for
a user which one would you like?
All right lets take
a look at both results.
And there are some differences and
there are some documents that
are return to both systems.
But if you look at the results
you will feel that well,
maybe an A is better in the sense that
we don't have many number in documents.
And among the three documents returned
the two of them are relevant, so
that's good, it's precise.
On the other hand can also
say maybe B is better because
we've got more relevant documents,
we've got three instead of two.
So which one is better and
how do we quantify this?
Well obviously, this question
highly depends on a user's task.
And, it depends on users as well.
You might be able to imagine, for
some users may be system made is better.
If the user is not interested in
getting all the relevant documents,
right, in this case this is
the user doesn't have to read.
User would see most relevant documents.
On the other hand on one count,
imagine user might need to have
as many relevant documents as possible,
for example, taking a literature survey.
You might be in the second category, and
then you might find
that system B's better.
So in either case, we'll have to also
define measures that would quantify them.
And we might need to define
multiple measures because
users have different perspectives
of looking at results.
[MUSIC]

